
Reprinted from 

EDUCATIONAL and 
PSYCHOLOGICAL 

MEASUREMENT 



A QUARTERLY JOURNAL DEVOTED TO THE DEVELOPMENT AND 
APPLICATION OF MEASURES OF INDIVIDUAL DIFFERENCES 



Clusters from Iterative, Intercolumnar Correlational 
Analysis. Louis L. McQuitty and James A. Clark 211 



Tetrachoric Correlation by the Camp Approximation. 
Edward E. Cureton 239 

A Measure of the Average Intercorrelation. Henry F. 
Kaiser 245 

An Important Similarity between Biserial r and the 
Brogden-Cureton-Glass Biserial r for Ranks. Julian 
C. Stanley 249 

Estimating Individual Rater Reliabilities from Anal- 
ysis of Treatment Effects. John E. Overall 255 

An Application of Balanced Incomplete Block Designs 
Id the Estimation of Test Norms. Thomas R. Knapp 265 

The Interrelationships among Item, Characteristics in 
an Adjective Check List: The Convergence of Dif- 
ferent Indices of Item Ambiguity. Lewis R. Gold- 
berg 273 



(Continued on inside front cover) 

VOLUME TWENTY-EIGHT, NUMBER TWO, SUMMER, 1968 



(Continued from front cover) 

Item Quality and Appropriateness of Response Processes. 
Castellano B. Turner and Donald W. Fiske 297 

Prediction of Anchor Effects on Personality Items from Rating 
Dispersions. James A. Walsh 317 

Medical Interns: A Study of Types. Margaret A. Howell . . . 327 

Social Intelligence and IQ. Ralph Hoepfner and Maureen 
0 'Sullivan 339 

An Exploratory Study of Programmed Tests. T. Anne Cleary, 
Robert L. Linn, and Donald A. Rock 345 

Intellectual Interest Patterns of Gifted College Students. 
Josephine S. Gottsdanker 361 

Stability of Engineering Interests over a Period of Four Years. 
W. Scott Gehman and Ila H. Gehman 367 

Validity Studies Section 377 

Book Review's 619 



This journal is open to: (1) discussions of problems in the field of the mea- 
surement of individual differences, (2) reports of research on the development 
and use of tests and measurements in education, industry, and government, 
(3) descriptions of testing programs being used for various purposes, and (4) 
miscellaneous notes pertinent to the measurement field, such as suggestions of 
new types of items or improved methods of treating test data. Contributors 
receive one hundred reprints of their articles without charge. Manuscripts 
should be sent in duplicate to Q> Frederic Kuder, Box 6907, College Station, 
Durham, North Carolina 27708. Authors are requested lo put tables and foot- 
notes on pages separate from the text and to follow the general directions 
given in the Publication Manual of tlie American Psychological Association 
(1051 Revision). Journal titles should not be abbreviated. 

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT is pub- 
lished quarterly, one volume per calendar year, at 2001 Byrdhill Rood, Rich- 
mond, Virginia 23205. Second class postage paid at Richmond, Virginia, 

Subscription rate, $14.00 a year, domestic and foreign. Single copies, S3.50. 
Back volumes: Volumes XX to the present $14.00 each; Volumes V through 
XIX, $10.00 each. Volumes I through IV are available in a small-print edition 
at S3.00 per volume (paper bound). 

Orders should be sent to EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, Box 6907, College Station, Durham, North Carolina 2770S. 

Copyright 1967 by G. Frederic Kuder 



Educational and Psychological Measurement 
1968, 28, 403-412. 



THE. INTERPRETATION OF WRONG ANSWERS FROM 
A MULTIPLE CHOICE TEST 

J. C. POWELL 
University of Alberta 

With an ever-increasing emphasis in our schools and clinics on 
the individualization of treatment, there are several aspects of our 
diagnostic procedures which require re-examination. The multiple 
choice test is a useful general screening device. However, for 
more detailed information we still resort to prolonged personal 
contact, often at considerable expense of time and effort. Is it 
possible that the precision of our screening devices can be im- 
proved to facilitate diagnosis and hence prognosis and treatment? 

This paper will concern itself with one limited aspect of this 
problem; namely, the functional role of the "wrong answers" on 
multiple choice tests. Much time is spent by the examiner in the 
preparation of foils for multiple choice tests. A proportionally 
large time is spent by the examinee in making his selection 
decisions among the alternatives. In spite of the time thus spent, the 
foils are generally treated as a mask to the right answer and are 
lumped together in a general "wrong category." The rating of 
the examinee is usually entirely dependent on his total number 
of correct items on any given test or subtest. On the other hand, 
if a multiple choice test has been well prepared, particular "wrong" 
answers may have nearly equivalent discriminating power as do 
the "right" answers. 

If foils do not have discriminating power other than to act as 
a mask for the "right" answer, then it is reasonable to assume that 
the selection of these foils by the examinee should be roughly 
random in distribution. This randomness assumption suggests sev- 
eral hypotheses. First, if examinees are asked to give reasons for 



403 



404 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 

their selection, there should be little more than chance equivalence 
among their stated reasons on any particular foil selection for all 
those individuals who select this particular foil. 

Second, although some foils may hang together factorially be- 
tween items, there should be no logical consistency between stated 
reasons within any particular statistical factor. 

Third, if examinees were to be subdivided into two groups, 
there should be little if any consistency between the reasons given 
by the members of the group of examinees on whose responses 
the statistical determinations were made and the stated reasons for 
the response selections of the second group of examinees on 
these same foils. 

Fourth, if two or more individuals choose the same pair of foils, 
that is, both individuals make identical errors on at least two 
different items, we will call this equivalence overlap. To maintain 
the random assumption, there should be no relationship between 
the proportion of overlap of each individual's "errors" with the 
errors of all other individuals and their rank on the total-correct 
score which they receive. 

Fifth, if we were to subdivide the examinees into five groups 
on the basis of percentile rank with respect to total-correct score, 
the random assumption suggests that for all such groups, the 
selection of all foils on any item generally should not be signifi- 
cantly different from random. 

Sixth, although some of the distributions in case five will be 
significant, there should be no consistent similarity between the 
placement of these significant deviations from random with the 
other wrong answer variables in the same factor. 

Although these six relationships do not exhaust the possibilities 
which the assumption of the random selection of foils would 
predict, these are sufficient that if a substantial proportion of these 
conclusions are refuted, the random assumption is also refuted. 

Procedure. In order to test these six hypotheses a commercially 
prepared multiple-choice test was used. This test, The Proverbs 
Test, by D. R. Gorman (1957) was used for two reasons. First, we 
did not wish to use a test which might reflect the particular bias of 
the author. Second, the test was developed from the scoring of a 
projective version of the same test. This means that the foils are 
based on the actual responses of disturbed persons and should, for 



J. C. POWELL 405 

this reason, reflect random selection if any foils set is going to do 
this. 

The test was administered to two classes of upperclassmen in 
a course in educational measurement. The members of one of 
these two classes (N—18) will be called Group A. The members 
of the other class well be called Group B (iV=23) . 

The following analyses were conducted: 

1. A binary matrix of the correct responses to every item by 
every subject in Group A was truncated to remove items 
with zero or near zero variance. The remaining matrix was 
factor analysed. 

2. A binary matrix of the particular foil selections among each 
of the three possible wrong responses to every item by every 
subject was also truncated to remove variables of zero or 
near zero variance. The remaining matrix was factor analysed. 
The criterion for truncation in both these instances was 
based on the response ratio (D) : 

D = £ XJN 

where X if is the binary value of the selection 
of the jth alternative by the ith individual. 
Such that (.10 < D < .90). 

3. Those foils which were related factorially were examined on 
the basis of the stated reasons for their selection to determine 
the possibility of consistency of these reasons within partic- 
ular foils, and among factorially related foils. This examina- 
tion was used to study the problems arising from the first 
two hypotheses mentioned above with reference to the random 
assumption for foil selection. 

4. The most frequently stated reason for the selection of a 
retained wrong answer in Group A became the variable which 
was used to predict that a semantically equivalent reason 
would be given for the same foil by anyone in Group B who 
selected it. A binary vector was developed equating the "hits" 
and "misses" with the total possible hits. This proportion 
was treated as an estimate of the variance accounted for in 
cross-validation. This procedure was designed to examine the 
third hypothesis. 



406 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 

5. The average overlap (as defined above) of each examinee's 
foil selection was calculated and the examinees were then 
ranked on the basis of descending order of these values. A 
rank order correlation coefficient was then calculated be- 
tween rank order of overlap and the corresponding rank 
order by total score correct for each examinee. This proce- 
dure refers to the fourth hypothesis. 

6. Groups A and B were combined and subdivided on the 
basis of rank order by total score correct with roughly one 
fifth of the combined group of examinees in each of these 
new subdivisions. Thus "UPPER 5" (see Table 3) refers to 
those individuals who received total correct scores in the 
top 20 per cent of the entire group. A chi square test for 
independence was done across all the foils of every item 
which had one of its foils as one of the retained foil variables. 
This procedure was used to examine the last two of our 
random assumption hypotheses. 

Results. The results of step one of the analysis left 32 of the 40 
items for factoring. A principal components analysis of these 
gave a very clear simple structure without rotation. There were 
three factors with eigenvalues greater than 1.00 with the first 
factor containing 29 of the items. When these 29 variables were 
used they represented a unif actor which accounted for 81.4 per 
cent of the variance on the test as shown in Table 1. The nature 
of this test suggests that the reduced test is a "Translation" test. 
This would put it into the comprehension level of Bloom's (1956) 
Taxonomy. 

The foil selection pattern was subjected to a similar procedure. 
In this case there was no clear pattern in the original factor 
matrix. For this reason several varimax rotations were performed 
on different numbers of factors. The criterion sought was the 
highest number of variables with high loadings on any one factor 
and low loadings on all others. A four factor rotation gave the 
best results to this criterion. The number of variables then con- 
sidered was reduced to those which showed simple structure. 
Variables among wrong answers were also rejected if their right 
answer had been rejected in step one. This procedure left 15 
foils distributed across the 29 items which had a total communality 
of 9.115 which when distributed across 3 X 29 = 87 possible 



J. C. POWELL 407 

foils accounts for 10.5 per cent of the variance as is shown in 
Table 1. Thus the combined results of both the right and the 
wrong answers for the shortened 29 item test accounts for a 
total of 91.9 per cent of the variance on this test. 

TABLE 1 

Factor Patterns for Right and Wrong Answers 



Eight Answers — First 
Principal Axis Hotelling 

Solution Test Wrong Answers Varimas Rotation 

Items Retained Alternatives Retained 



Item" 


Factor Loading 


Item' 


I II III 


IV 


lc 


.929 


3a 


.590 




2d 


.898 


14c 


.732 




3d 


.938 


22a 


.900 




6c 


.941 


27a 


.518 




8c 


.954 


29b 
13d 
15d 


.765 

.916 
.647 




9b 


.839 


19c 


.916 




10c 


.835 


23d 


.668 




lid 


.924 


9c 


.814 




13c 


.838 


13a 


.682 




14d 


.944 


18a 
2b 
20a 


.6S9 


.713 
.744 


15c 


.905 


21a 




.619 


17a 


.966 








18b 


.948 














19d 


.918 




Alternatives Dropped 




20c 


.904 


10a 
16a b 


.442 

.874 




21c 


.898 


16b 


.661 


.416 


22c 


.914 


23a 


.567 




23c 


.849 


24d 


.442 




24c 


.935 


25a h 






26b 


.935 


25c 
31d 


.594 .526 


.801 


27b 


.942 








29c 


.886 




Total Communality of Retained 




31b 


.897 




Alternates 9. 115 




32a 


.911 








34c 


.935 









35c .841 
36b .954 
39b .855 
40c .848 
Total Communality 23.708 



■ Thia number and letter combination refers to corresponding item and alternative on the 
Proverbs Test. 

b Items 16 and 25 were eliminated from test on right answer reduction. 



408 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 

A careful examination of the reasons given for the selection of 
the foils for each of the 15 foil variables retained showed that 
about 60 per cent of the examinees gave semantically equivalent 
explanations for their selection. 

When these reasons were examined for the logic behind them 
across all the members of a foil factor, a fairly clear psychological 
interpretation of each factor occured. These factors displayed the 
following possible interpretation: 

Factor I: In each of these cases the subject seems to be sim- 
plifying part of the information contained in the 
proverb. 

Factor II: In this case each of the subjects seems to be adding 
an additional bit of irrelevant information in ar- 
riving at the answer. 

Factor III: In this case there seems to be a substitution of 
meaning for one element in the answer. 

Factor IV: In this case the selected foil seems to be another 
proverb bearing little or no relationship to the 
proverb stated in the stem. 

These factor interpretations seem to be consistent with a logical 
analysis of the items themselves. Table 2 illustrates this point by 
selecting one item from each sub-test, giving the appropriate 
foil for the sub-test, and listing examples of the reasons for selec- 
tion, from both groups of examinees. 

When Group A's usual reasons were used to predict the reasons 
for foil selection given by Group B, it was found that Group B's 
reasons were semantically equivalent to the most frequent reasons 
in Group A 64 per cent of the time. This finding would imply 
a cross-validation correlation coefficient of r = V.64 = .80. (No- 
tice the equivalence indicated in Table 2.) 

The: rank order correlation between the proportion of overlap 
and the rank order of total correct score was .47 which is signifi- 
cant at the .05 level. This finding implies that the errors of the 
higher performing students are more alike than those of the lower 
performing students. This situation is, of course, also true for 
their correct responses. 

Finally, a chi square test of independence on the ranked groups 
of the combined classes, showed that a proportion of the foil 



J. C. POWELL 



409 



TABLE 2 

Sample Analysis of Items to Illustrate Classification of Foils 

Factor I Reduction of information to affect simplification of statement. 

Exam-pie Item 22. A ROLLING STONE GATHERS NO MOSS. 

Foil a. BE CONSISTENT. 
Analysis This stem can be described as a relationship between an object 

an action, and a negative condition. The foil drops the negative 
aspect of the condition, changing the meaning. BE INCON- 
SISTENT would be better, but in this case, the relative value 
judgement of "moss gathering" in the original is reversed to 
the usual connotation given this proverb. 
Sample Responses 
First Group 



Second Group 



1. Not settled — rolling stone. 

2. Related rolling stone with consistent. 



L Rolling stone — don't settle down. 
2. Consistent and no moss. 

In each case the reason listed seems to be more closely related 
to the dropping of the negative than misunderstanding the 
proverb. In other words, this error is in the translation process 
rather than in the comprehension of the stem. 
Addition of irrelevant information. 

Item 13. THE WORST SPOKE IN A CART BREAKS 
FIRST. 

Foil d. IT TAKES A GOOD MAN TO KEEP ON TRYING. 
Here we have an object-event-condition statement again. The 
object is treated as a unit. In the foil this unitary character- 
istic is changed and the weak point isolated as separate from 
the whole, a value judgement is then added. 
Sample Responses 
First Group 

1. A weak person will stop trying. 

Second Group 



Comments: 



Factor II 
Example 

Analysis 



Comment: 

Factor III 
Example 



Analysis 



Sample Response 
First Group 

Second Group 



2. No interpretable reasons in this group. 

Again the translation process seems to be the source of the 

error. 

Substitution of elements. 

Item 9. THE MORE COST, THE MORE HONOR. 
Foil c. THE HIGHER THE PRICE, THE BETTER A 
THING IS. 

In this case "honor" is translated as "better" rather than 
as "appreciate" in the correct response. This subtle shift in 
meaning leads to a statement about the quality of an object 
itself rather than the value attributed to it. Hence, we classify 
this error as being the product of inappropriate substitution 
in the translation process. 



1. Cost implies price. 

2. Parallel parts in both elements. 

Comment: By restricting the meaning of the word "cost" when translating 

a substitution for "honor" becomes necessary. There is a 
degree of misunderstanding of the stem in this example. 



410 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
TABLE 2— Continued 



Factor IV Replacement of proverb by one largely unrelated (Irrelevancy 

Fallacy). 

Example Item 20. THE USED KEY IS ALWAYS BRIGHT. 

Foil a. SOMETHING OLD IS BETTER THAN 
SOMETHING NEW. 

Analysis The term "used", used here in the sense "much used" is trans- 

lated by substitution into "old", and the term "always bright" 
into "better than". Hence the end product has no similarity 
in meaning to the stem. 

Sample Response 
First Group 

1. Implies sentiment for old things as does stem. 

Second Group 

2. The word "used" key to answer. 

Comment: In this factor we can argue with some safety that the errors 

arose from a failure to understand the stem. It is interesting 
to note that a clear lack of understanding accounts for (ess 
than two per cent of the varience on this test. 

selection which is different from chance is considerably greater 
than would be expected if foil selection were randomly distri- 
buted. The .10 level of significance was used because with a 
combined N of 41 which is then subdivided into five groups of 
about eight each, such small numbers are involved that a more 
stringent test would be unreasonable. Table 3 recapitulates the 
results. It should be noted that the upper 5 approached .10 level 
for every case in Factor IV. The small number of cases in this 
"Irrelevancy" factor have an adverse effect on the possibility of 
obtaining stable significant results for this factor. 

Discussion. All six of the hypotheses derived from the assump- 
tion of random selection of foils must be rejected by the finding 
of this study. 

First, not only are the reasons people give for selecting a foil 
consistent for that particular foil, but they are also consistent 
across members of the same statistically determined factor and 
also in a cross-validation with another group. 

Second, the responses of examinees tend to become progressively 
more alike in both right and wrong responses as their total correct 
score increases. 

Third, not only are particular foils inclined to identify partic- 
ular groups within specific percentile ranges, but the identification 
of these groups is consistent across all members of the same 
factor and relatively distinctive from other factors. It is probable 



J. C. POWELL 411 
TABLE 3 



Significance of Foil Selection for Wrong Answer Factors" 



FACTOR I 










Level of 






FOILS 




Tninl Srnrp 


3a 


14c 


22a 




Upper 5 
4 


b 




— 






3 


.05 


.05 


— 
.10 


— 

.05 — 


2 


.02 


.02 


.01 


.02 .01 


1 

FACTOR II 












13d 


15c? 


19c 


23c 


Upper 5 
4 





. 10 


z 





3 
2 


— 
.05 


— 
— 


— 
.10 


— 

— 


1 


.05 


.10 


.10 


.05 


FACTOR II 












9c 


13a 


18a 




Upper 5 
4 


— 
— 


— 


.10 




3 
2 




— 
.10 







1 


.01 


.10 


.10 




FACTOR rV 












26 


20a 


21a 




Upper 5 
4 






.10 





3 


.10 


.02 


.10 




2 


.10 








1 


.10 








* Chi square test for independence. 








b Only p of x s < -10 are shown. 








that this last 


finding identifies the 


basis upon 


which the factor 


analysis separates the factors. 


If thi 


s is the case 


, then this proce- 



dure may make it possible to differentiate between individuals 



with the same total-correct score on the basis of differential foil 
selection. 

On this basis, we can state that the random assumption for the 
selection of foils for this particular multiple-choice is not tenable. 
Because of the nature of this test, this statement could have a 
broad generality. 

If foils on multiple choice tests contain information about the 
examinees, what is the value of this information in a clinical or 
educational setting for purposes of diagnosis, prognosis, treatment, 



412 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 

or determining the effectiveness of treatment? Any statement in 
this area would be unduly speculative at the moment. Attempts 
to answer these questions might open fruitful avenues of enquiry. 

Summary. A commercially prepared multiple choice test, The 
Proverbs Test, was analyzed to determine the possible informa- 
tion content of the wrong answers. It was demonstrated that the 
assumption that wrong answers are randomly distributed does 
not hold in this case. Wrong answers clearly have psychological 
and statistical meaning. The implications of these findings could 
have great importance for test interpretation. Further research is 
necessary to define these implications. 

REFERENCES 

Bloom, B. S. (Ed.). A Taxonomy of Educational Objectives, Hand- 
book I: Cognitive Domain. New York: David MacKay, 1956. 

Gorham, D. R. The Proverbs Test. Missoula, Montana: Psycho- 
logical Test Specialists, 1957. 




V 




